On ergodic two-armed bandits

نویسندگان

  • Pierre Tarrès
  • Pierre Vandekerkhove
چکیده

A device has two arms with unknown deterministic payoffs, and the aim is to asymptotically identify the best one without spending too much time on the other. The Narendra algorithm offers a stochastic procedure to this end. We show under weak ergodic assumptions on these deterministic payoffs that the procedure eventually chooses the best arm (i.e. with greatest Cesaro limit) with probability one, for appropriate step sequences of the algorithm. In the case of i.i.d. payoffs, this implies a “quenched” version of the “annealed” result of Lamberton, Pagès and Tarrès in 2004 [6] by the law of iterated logarithm, thus generalizing it. More precisely, if (ηl,i)i∈N ∈ {0, 1} , l ∈ {A,B}, are the deterministic reward sequences we would get if we played at time i, we obtain infallibility with the same assumption on nonincreasing step sequences on the payoffs as in [6], replacing the i.i.d. assumption by the hypothesis that the empirical averages ∑n i=1 ηA,i/n and ∑n i=1 ηB,i/n converge, as n tends to infinity, respectively to θA and θB, with rate at least 1/(log n)1+ε, for some ε > 0. University of Oxford, Mathematical Institute, 24-29 St Giles, Oxford OX1 3LB, United Kingdom, [email protected]. Supported in part by the Swiss National Foundation Grant 200021-1036251/1, and by a Leverhulme Prize. Université Paris-Est Marne-la-Vallée, LAMA, 5 boulevard Descartes, Champs-surMarne, 77454 Marne-la-Vallée Cedex 2, France, [email protected] AMS 2000 Subject Classifications. Primary 62L20; secondary 93C40, 91E40, 68T05, 91B32.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semi-Bandits with Knapsacks

We unify two prominent lines of work on multi-armed bandits: bandits with knapsacks and combinatorial semi-bandits. The former concerns limited “resources” consumed by the algorithm, e.g., limited supply in dynamic pricing. The latter allows a huge number of actions but assumes combinatorial structure and additional feedback to make the problem tractable. We define a common generalization, supp...

متن کامل

Modal Bandits

Analyses of multi-armed bandits primarily presume that the value of an arm is its expected reward. We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions.

متن کامل

Strategic Exit with Random Observations

In the standard optimal stopping problems, actions are artificially restricted to the moments of observations of costs or benefits. In the standard experimentation and learning models based on two-armed Poisson bandits, it is possible to take an action between two sequential observations. The latter models do not recognize the fact that timing of decisions depends not only on the rate of arriva...

متن کامل

Sequential Monte Carlo Bandits

In this paper we propose a flexible and efficient framework for handling multi-armed bandits, combining sequential Monte Carlo algorithms with hierarchical Bayesian modeling techniques. The framework naturally encompasses restless bandits, contextual bandits, and other bandit variants under a single inferential model. Despite the model’s generality, we propose efficient Monte Carlo algorithms t...

متن کامل

A Survey on Contextual Multi-armed Bandits

4 Stochastic Contextual Bandits 6 4.1 Stochastic Contextual Bandits with Linear Realizability Assumption . . . . 6 4.1.1 LinUCB/SupLinUCB . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.1.2 LinREL/SupLinREL . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.1.3 CofineUCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.1.4 Thompson Sampling with Linear Payoffs...

متن کامل

Reducing Dueling Bandits to Cardinal Bandits

We present algorithms for reducing the Dueling Bandits problem to the conventional (stochastic) Multi-Armed Bandits problem. The Dueling Bandits problem is an online model of learning with ordinal feedback of the form “A is preferred to B” (as opposed to cardinal feedback like “A has value 2.5”), giving it wide applicability in learning from implicit user feedback and revealed and stated prefer...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009